IODS Course Project

Introduction

Hi, I am Aarthi Ravindran, doing PhD in University of Eastern Finland. I heard about this course from AIVI coordinators, they circulated mail in UEF. In todays science field, we have excess of open data. If we learn the of art of dealing open data science then we can utlize them to our research. Click the link below to see my github repository and course diary webpage.

Aarthi Ravindran Github repository
Aarthi Ravindran Course Diary


RStudio Exercise 1: Tasks and Instructions

This week was introduction class and hence it was all about getting familar with Rstudio, Github and Datacamp. I did following task like,
1. installation of tools
* R
* RStudio
* GitHub
2. Creating course templates
* GitHub
* RStudio
3. Learing R on Datacamp
Part:1 Hello R
Part:2 Getting Hooked on R
4. Rstudio Excercise1: Tasks and Instructions
*In this excercise, IODs project from Github is forked into my github project. It was then opened in Rstudio to edit the index.Rmd, chapter1.Rmd and chapter2.Rmd and filling the required information. The edited files are saved and knitted for updation. The saved and knit files are updated in Git from Rstudio using commit (everytime note the changes in commit) and push options. Once the updation is done, refresh your github project page and look for the changes you have made.

RStudio Exercise 2: Regression and Model Validation

This week we started with doing data wrangling and with obtained data further regression and model validation is done.

1. Data Wrangling or Data Subsetting OVerhere, the data is subsetted into simple data for further regression analysis. The major data of 183 row with 60 columns are subset into 166 rows and 7 columns.
This subset data is stored in https://github.com/AarthiRavindran/IODS-project/data/

Let’s read the data and explore its dimension and structure.

# Exercise:1 Data Analysis
# Read the students2014 data into R either from your local folder (if you completed the Data wrangling part) or from this url: http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt . (The separator is a comma "," and the file includes a header). Explore the structure and the dimensions of the data and describe the dataset briefly, assuming the reader has no previous knowledge of it. There is information related to the data here. (0-2 points)
#Read the data
setwd("~/IODS-project/")
library(dplyr)
learning2014 <- read.table("~/IODS-project/data/learning2014.txt")

Explore the data by using dimension, structure and header commands

# Explore the data
dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
head(learning2014)
##   gender age attitude     deep  stra     surf points
## 1      F  53      3.7 3.583333 3.375 2.583333     25
## 2      M  55      3.1 2.916667 2.750 3.166667     12
## 3      F  49      2.5 3.500000 3.625 2.250000     24
## 4      M  53      3.5 3.500000 3.125 2.250000     10
## 5      M  49      3.7 3.666667 3.625 2.833333     22
## 6      F  38      3.8 4.750000 3.625 2.416667     21
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

By exploring data, it has the columns like age, gender, attitude, deep questions, strategy questions, surface questions and points. The data has 166 rows and 7 columns. This data has 110 females and 56 males with minimum age of 17 and maximum age of 55.

Graphical Overview

#Question: 2
#Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-3 points)
pairs(learning2014[!names(learning2014) %in% c("gender")],col=learning2014$gender)

#install.packages("GGally")
#install.packages("ggplot2")
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(ggplot2)

# create a more advanced plot matrix with ggpairs()
ggpairs(learning2014, 
        mapping = aes(col = gender, alpha = 0.3), 
        lower = list(combo = wrap("facethist", bins = 20))
        )

Result command: We can see the ratio of female gender is high, the minimum age and maximum of each gender various, the attitute, deep, stra are postively correlated to age whereas surf and points are negatively correlated to age, likewise we can see for each variables. The highesh correlation is between the points and attitute. deep and surf are having lowest negative correlation between them.

2. Linear REgression

#Question: 3
#Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it. (0-4 points)  
#Question:4
#Using a summary of your fitted model, explain the relationship between the chosen explanatory variables and the target variable (interpret the model parameters). Explain and interpret the multiple R squared of the model. (0-3 points)

Lets take the negatively correlation between the deep and surf and look close into it.

qplot(surf, deep, data= learning2014) + geom_smooth(method = "lm")

Let’s check how the linear model to this data works. The equation for the model is \[ Y_i = \alpha + \beta_1 X_i + \epsilon_i \] whereas, Y = surf
X = deep
\(\alpha\) = constant \(\beta_1\) = regression coefficient for deep
\(\epsilon\) = random term.

Estimation of the model yields the following results:

my_model1 <- lm(surf ~ deep, data = learning2014)
results1 <- summary(my_model1)
knitr::kable(results1$coefficients, digits=3, caption="Regression coefficients")
Regression coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.924 0.262 14.958 0
deep -0.309 0.071 -4.383 0

Intercept as well as deep are not statistically significant predictors. Coefficient of determination \(R^2\) = 0.1048477 that is not particularly high.

From above graphs we can see the highest correlation between points and attitute. Let’s take them for further analysis.

qplot(attitude, points, data = learning2014) + geom_smooth(method = "lm")

Let’s start fitting the linear model for positively correlated Points and attitude which are having highest score too.
The equation for the model is \[ Y_i = \alpha + \beta_1 X_i + \epsilon_i \] whereas,
Y = points
X = attitude
\(\alpha\) = constant
\(\beta_1\) = regression coefficient for attitude
\(\epsilon\) = random term.

Estimation of the model yields the following results:

my_model <- lm(points ~ attitude, data = learning2014)
results <- summary(my_model)
knitr::kable(results$coefficients, digits=3, caption="Regression coefficients")
Regression coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.637 1.830 6.358 0
attitude 3.525 0.567 6.214 0

Intercept as well as attitude are statistically significant predictors. Coefficient of determination \(R^2\) = 0.1905537 that is not particularly high. Probably some more predictors could be added to the both model studied here.

Diagnostic Plots

#Question 5
#Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage. Explain the assumptions of the model and interpret the validity of those assumptions based on the diagnostic plots. (0-3 points)   
#Taking highly correlated variables for diagnostic plot
plot(my_model, which=c(1,2,5))

Result: From diagnostic plot, we can see the outlier. maybe if we remove the outlier, it is possible to get good Rsquare value and the model may fit more efficiently.


RStudio Exercise 3: Tasks and Instructions

Third week class is about logistic regression.

Logistic REgression:
This is an statistical model that uses a logistic function to model a binary dependent variable. In logistic regression (or logit regression) is estimating the parameters in the form of binary (0 and 1) as we seen in previous exercise the use of linear regression is to estimate the parameters that in the continuous form (-infinity to +infinity).

Data wrangling (max 5 points) 1. Take the data file student-mat.csv and student-por.csv from given link.
2. Creat new rscript in ‘create_alc.R’ and save in data file 3. Read both student-mat.csv and student-por.csv files and explore the structure and dimensions of the data - (1 point- Done)
4. Join the two data sets using the variables “school”, “sex”, “age”, “address”, “famsize”, “Pstatus”, “Medu”, “Fedu”, “Mjob”, “Fjob”, “reason”, “nursery”,“internet” as (student) identifiers. Keep only the students present in both data sets. Explore the structure and dimensions of the joined data. (1 point - Done)
5. Either a) copy the solution from the DataCamp exercise The if-else structure to combine the ‘duplicated’ answers in the joined data, or b) write your own solution to achieve this task. (1 point - Done) 6. Take the average of the answers related to weekday and weekend alcohol consumption to create a new column ‘alc_use’ to the joined data. Then use ‘alc_use’ to create a new logical column ‘high_use’ which is TRUE for students for which ‘alc_use’ is greater than 2 (and FALSE otherwise). (1 point - done)
7. Glimpse at the joined and modified data to make sure everything is in order. The joined data should now have 382 observations of 35 variables. Save the joined and modified data set to the ‘data’ folder, using for example write.csv() or write.table() functions. (1 point - done)

Aarthi Ravindran, 13th September 2019

Data Wrangling - 5 points 1. Take the data file student-mat.csv and student-por.csv from given link.
2. Creat new rscript in ‘create_alc.R’ and save in data file.
3. Read both student-mat.csv and student-por.csv files and explore the structure and dimensions of the data.

# Reading the files
student_mat <- read.csv("C:/Users/Aarthi/Documents/IODS-project/data/student-mat.csv", sep = ";", header = TRUE)
student_por <- read.csv("C:/Users/Aarthi/Documents/IODS-project/data/student-por.csv", sep = ";", header = TRUE)
# Explore the files
dim(student_mat)
## [1] 395  33
dim(student_por)
## [1] 649  33
str(student_mat)
## 'data.frame':    395 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3        : int  6 6 10 15 10 15 11 6 19 15 ...
str(student_por)
## 'data.frame':    649 obs. of  33 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3        : int  11 11 12 14 13 13 13 13 17 13 ...
#Join the data
library(dplyr)
join_by <- c("school", "sex", "age", "address", "famsize", "Pstatus", "Medu", "Fedu", "Mjob", "Fjob", "reason", "nursery","internet")
student_mat_por_joined <- inner_join(student_mat, student_por, by=join_by)
dim(student_mat_por_joined)
## [1] 382  53
str(student_mat_por_joined)
## 'data.frame':    382 obs. of  53 variables:
##  $ school      : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex         : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age         : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address     : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize     : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus     : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu        : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu        : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob        : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob        : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason      : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ guardian.x  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime.x: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime.x : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures.x  : int  0 0 3 0 0 0 0 0 0 0 ...
##  $ schoolsup.x : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup.x    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid.x      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities.x: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ nursery     : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ higher.x    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ internet    : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ romantic.x  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel.x    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime.x  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout.x     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc.x      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc.x      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health.x    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences.x  : int  6 4 10 2 4 10 0 6 0 0 ...
##  $ G1.x        : int  5 5 7 15 6 15 12 6 16 14 ...
##  $ G2.x        : int  6 5 8 14 10 15 12 5 18 15 ...
##  $ G3.x        : int  6 6 10 15 10 15 11 6 19 15 ...
##  $ guardian.y  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime.y: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime.y : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures.y  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ schoolsup.y : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup.y    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid.y      : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
##  $ activities.y: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher.y    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic.y  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel.y    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime.y  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout.y     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc.y      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc.y      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health.y    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences.y  : int  4 2 6 0 0 6 0 2 0 0 ...
##  $ G1.y        : int  0 9 12 14 11 12 13 10 15 12 ...
##  $ G2.y        : int  11 11 13 14 13 12 12 13 16 12 ...
##  $ G3.y        : int  11 11 12 14 13 13 13 13 17 13 ...
glimpse(student_mat_por_joined)
## Observations: 382
## Variables: 53
## $ school       <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, G...
## $ sex          <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F...
## $ age          <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 1...
## $ address      <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U...
## $ famsize      <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3,...
## $ Pstatus      <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T...
## $ Medu         <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4...
## $ Fedu         <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4...
## $ Mjob         <fct> at_home, at_home, at_home, health, other, service...
## $ Fjob         <fct> teacher, other, other, services, other, other, ot...
## $ reason       <fct> course, course, other, home, home, reputation, ho...
## $ guardian.x   <fct> mother, father, mother, mother, father, mother, m...
## $ traveltime.x <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1...
## $ studytime.x  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3...
## $ failures.x   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup.x  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no...
## $ famsup.x     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, ye...
## $ paid.x       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes...
## $ activities.x <fct> no, no, no, yes, no, yes, no, no, no, yes, no, ye...
## $ nursery      <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ...
## $ higher.x     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,...
## $ internet     <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, ye...
## $ romantic.x   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, ...
## $ famrel.x     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3...
## $ freetime.x   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2...
## $ goout.x      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3...
## $ Dalc.x       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc.x       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2...
## $ health.x     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2...
## $ absences.x   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4,...
## $ G1.x         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10...
## $ G2.x         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10...
## $ G3.x         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 1...
## $ guardian.y   <fct> mother, father, mother, mother, father, mother, m...
## $ traveltime.y <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1...
## $ studytime.y  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3...
## $ failures.y   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
## $ schoolsup.y  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no...
## $ famsup.y     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, ye...
## $ paid.y       <fct> no, no, no, no, no, no, no, no, no, no, no, no, n...
## $ activities.y <fct> no, no, no, yes, no, yes, no, no, no, yes, no, ye...
## $ higher.y     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes,...
## $ romantic.y   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, ...
## $ famrel.y     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3...
## $ freetime.y   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2...
## $ goout.y      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3...
## $ Dalc.y       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ Walc.y       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2...
## $ health.y     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2...
## $ absences.y   <int> 4, 2, 6, 0, 0, 6, 0, 2, 0, 0, 2, 0, 0, 0, 0, 6, 1...
## $ G1.y         <int> 0, 9, 12, 14, 11, 12, 13, 10, 15, 12, 14, 10, 12,...
## $ G2.y         <int> 11, 11, 13, 14, 13, 12, 12, 13, 16, 12, 14, 12, 1...
## $ G3.y         <int> 11, 11, 12, 14, 13, 13, 13, 13, 17, 13, 14, 13, 1...
# result intepretation : when i eplore joined data, it seems the duplicates are joined as x and y cordinates, so we have to remove the duplicates
#To combine the duplicates
# create a new data frame with only the joined columns
alc <- select(student_mat_por_joined, one_of(join_by))

# columns that were not used for joining the data
notjoined_columns <- colnames(student_mat)[!colnames(student_mat) %in% join_by]

# print out the columns not used for joining
notjoined_columns
##  [1] "guardian"   "traveltime" "studytime"  "failures"   "schoolsup" 
##  [6] "famsup"     "paid"       "activities" "higher"     "romantic"  
## [11] "famrel"     "freetime"   "goout"      "Dalc"       "Walc"      
## [16] "health"     "absences"   "G1"         "G2"         "G3"
# for every column name not used for joining...
for(column_name in notjoined_columns) {
  # select two columns from 'math_por' with the same original name
  two_columns <- select(student_mat_por_joined, starts_with(column_name))
  # select the first column vector of those two columns
  first_column <- select(two_columns, 1)[[1]]
  
  # if that first column  vector is numeric...
  if(is.numeric(first_column)) {
    # take a rounded average of each row of the two columns and
    # add the resulting vector to the alc data frame
    alc[column_name] <- round(rowMeans(two_columns))
  } else { # else if it's not numeric...
    # add the first column vector to the alc data frame
    alc[column_name] <- first_column
  }
}

# glimpse at the new combined data
glimpse(alc)
## Observations: 382
## Variables: 33
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
#Take average of weekend and weekday alcohol consumption by creating new column in the data
alc <- mutate(alc, alc_use = (Dalc + Walc) / 2)
#Create new column by highuse in which the students who have value more than 2 is TRUE
alc <- mutate(alc, high_use = "alc_use" > 2)
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <dbl> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <dbl> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <dbl> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <dbl> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <dbl> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <dbl> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <dbl> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <dbl> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <dbl> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1         <dbl> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2         <dbl> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3         <dbl> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRU...

write.csv(alc, “C:/Users/Aarthi/Documents/IODS-project/data/new_data.csv”) write.table(alc, “C:/Users/Aarthi/Documents/IODS-project/data/new_data.txt”)

Analysis (max 15 points) 1. Create ‘chapter3.Rmd’ and include it in ‘index.Rmd’ to perform analysis in chapter3.Rmd file. (Done)
2. Read the joined and print the variables in the data and describe it (0-1 point). (Done)
3. Select variables to study the relationships between high/low alcohol consumption of alcohol in the data. (0-1 point) (Done).
4. Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption. (0-5 points)
5. Use logistic regression to statistically explore the relationship between your chosen variables and the binary high/low alcohol consumption variable as the target variable.(0-5 points) (Done)
6. Using the variables which, according to your logistic regression model, had a statistical relationship with high/low alcohol consumption, explore the predictive power of you model. (0-3 points) (Done)

start of analysis About data:
This data is based on student alcohol consumption. The student who are present in two different data are taken and combined. The joined data set has 382 students and has description about their school name, sex, age of the student, their address, family size, parents status, mothers education, fathers education, mothers job, fathers job, activities, alcohol consumption per day and in weekend, their health, average of alcohol usage and column saying whether they are drinking high or not.

alc <- read.table("C:/Users/Aarthi/Documents/IODS-project/data/new_data.txt", sep = " ")
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The variables that might be associated or affected with the alcohol consumption from my point of view is sex, health, absences, failures and activities.

#install.packages("ggplot2")
#install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble  2.1.3     v purrr   0.3.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ggplot2)
gather(alc) %>% glimpse
## Warning: attributes are not identical across measure variables;
## they will be dropped
## Observations: 13,370
## Variables: 2
## $ key   <chr> "school", "school", "school", "school", "school", "schoo...
## $ value <chr> "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "GP", "G...
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
## Warning: attributes are not identical across measure variables;
## they will be dropped

alc %>% group_by(sex, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 2 x 4
## # Groups:   sex [2]
##   sex   high_use count mean_grade
##   <fct> <lgl>    <int>      <dbl>
## 1 F     TRUE       198       11.5
## 2 M     TRUE       184       11.5
alc %>% group_by(activities, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 2 x 4
## # Groups:   activities [2]
##   activities high_use count mean_grade
##   <fct>      <lgl>    <int>      <dbl>
## 1 no         TRUE       181       11.2
## 2 yes        TRUE       201       11.7
alc %>% group_by(health, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 5 x 4
## # Groups:   health [5]
##   health high_use count mean_grade
##    <int> <lgl>    <int>      <dbl>
## 1      1 TRUE        46       12.6
## 2      2 TRUE        44       11.5
## 3      3 TRUE        81       11.3
## 4      4 TRUE        67       11.4
## 5      5 TRUE       144       11.2
alc %>% group_by(absences, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 26 x 4
## # Groups:   absences [26]
##    absences high_use count mean_grade
##       <int> <lgl>    <int>      <dbl>
##  1        0 TRUE        65       12.0
##  2        1 TRUE        51       11.6
##  3        2 TRUE        58       11.4
##  4        3 TRUE        41       11.7
##  5        4 TRUE        36       11.5
##  6        5 TRUE        22       11.3
##  7        6 TRUE        21       12.1
##  8        7 TRUE        12       11  
##  9        8 TRUE        20       11.2
## 10        9 TRUE        12       10.5
## # ... with 16 more rows
alc %>% group_by(failures, high_use) %>% summarise(count = n(), mean_grade = mean(G3))
## # A tibble: 4 x 4
## # Groups:   failures [4]
##   failures high_use count mean_grade
##      <int> <lgl>    <int>      <dbl>
## 1        0 TRUE       334      11.9 
## 2        1 TRUE        24       8.67
## 3        2 TRUE        19       7.53
## 4        3 TRUE         5       7.4
g1 <- ggplot(alc, aes(x = high_use, y = G3, col = sex))
g1 + geom_boxplot() + ylab("grade")

g2 <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g2 + geom_boxplot() + ggtitle("Student absences by alcohol consumption and sex")

Find the correlation for the variables and alcohol consumption

m <- glm(high_use ~ failures + absences, data = alc, family = "binomial")
## Warning: glm.fit: algorithm did not converge
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## 2.409e-06  2.409e-06  2.409e-06  2.409e-06  2.409e-06  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)
## (Intercept)  2.657e+01  2.422e+04   0.001    0.999
## failures    -3.110e-08  3.136e+04   0.000    1.000
## absences     1.430e-09  3.347e+03   0.000    1.000
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 0.0000e+00  on 381  degrees of freedom
## Residual deviance: 2.2162e-09  on 379  degrees of freedom
## AIC: 6
## 
## Number of Fisher Scoring iterations: 25
coef(m)
##   (Intercept)      failures      absences 
##  2.656607e+01 -3.109885e-08  1.429756e-09
#Find the logistic regression of variables failures, absences, sex, health, activities
library(dplyr)
m <- glm(high_use ~ sex + absences, data = alc, family = "binomial")
## Warning: glm.fit: algorithm did not converge
is.na(m)
##      coefficients         residuals     fitted.values           effects 
##             FALSE             FALSE             FALSE             FALSE 
##                 R              rank                qr            family 
##             FALSE             FALSE             FALSE             FALSE 
## linear.predictors          deviance               aic     null.deviance 
##             FALSE             FALSE             FALSE             FALSE 
##              iter           weights     prior.weights       df.residual 
##             FALSE             FALSE             FALSE             FALSE 
##           df.null                 y         converged          boundary 
##             FALSE             FALSE             FALSE             FALSE 
##             model              call           formula             terms 
##             FALSE             FALSE             FALSE             FALSE 
##              data            offset           control            method 
##             FALSE             FALSE             FALSE             FALSE 
##         contrasts           xlevels 
##             FALSE             FALSE
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
#CI <- confint(m) %>% exp

# print out the odds ratios with their confidence intervals
#cbind(OR, CI)
# fit the model
m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
## Warning: glm.fit: algorithm did not converge
# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# see the last ten original classes, predicted probabilities, and class predictions
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
##     failures absences sex high_use probability prediction
## 373        1        0   M     TRUE           1       TRUE
## 374        1        7   M     TRUE           1       TRUE
## 375        0        1   F     TRUE           1       TRUE
## 376        0        6   F     TRUE           1       TRUE
## 377        1        2   F     TRUE           1       TRUE
## 378        0        2   F     TRUE           1       TRUE
## 379        2        2   F     TRUE           1       TRUE
## 380        0        3   F     TRUE           1       TRUE
## 381        0        4   M     TRUE           1       TRUE
## 382        0        2   M     TRUE           1       TRUE
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use TRUE
##     TRUE  382

From this it is predicted the sex, absences and failures are affected by alcohol consumption.


Clustering and classification

Aarthi Ravindran.20/11/2019 - RStudio Exercise 4: Tasks and Instructions Note: for exercise 5 data wraningling please check the below link

Analysis Exercises - 15 marks

  1. Create chapter4.Rmd file and include the child file in ‘index.Rmd’ file.
  2. Load the Boston data from the MASS package and explore the data. (0-1 points)
  3. Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them. (0-2 points)
#load Boston data from Mass package (Exercise 4.2)

#The following R packages are installed already, which are mandatory to run this analysis
#install.packages("MASS")
#install.packages("corrr")
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(tidyr)
library(corrr)
library(dplyr)
#install.packages("corrplot")
library(corrplot)
## corrplot 0.84 loaded
#if(!require(devtools)) install.packages("devtools")
#devtools::install_github("ggobi/ggally")
#install.packages("GGally")
library(GGally)
data(Boston)

#Explore the Boston data (Exercise 4.2)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506  14
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
#Graphical Overview of data (Exercise 4.3)
pairs(Boston)

#Calculate the correlation of variables in the matrix
cor_matrix <- cor(Boston)%>%round(digits = 2)
#print correlation matrix
cor_matrix
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
#visualize the correlation matrix
corrplot(cor_matrix, metho="circle", type="upper", cl.pos="b", tl.pos="d", tl.cex = 0.6)

#Another package for correlation matrix
# calculate the correlation matrix and round it
cor_matrix_gg <- ggcorr(Boston, palette = "RdBu", label = TRUE)
cor_matrix_gg

*Description of Boston Data:*
Bostan data is about the housing values in suburbs.This data is having 506 rows and 14 columns (using dim()), the columns are explained below,
crim - per capita crime rate by town, zn - proportion of residential land zoned for lots over 25,000 sq.ft., indus - proportion of non-retail business acres per town, chas - Charles River dummy variable (= 1 if tract bounds river; 0 otherwise), nox - nitrogen oxides concentration (parts per 10 million), rm - average number of rooms per dwelling, age - proportion of owner-occupied units built prior to 1940, dis - weighted mean of distances to five Boston employment centres, rad - index of accessibility to radial highways, tax - full-value property-tax rate per $10,000, ptration - pupil-teacher ratio by town, pratio - pupil-teacher ratio by town, black - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town, lstat - lower status of the population (percent), medv - median value of owner-occupied homes in $1000s.

Boston Data Variables Interpretation from Data Exploration and Graphical Plots:
From str() command we can see all the variables in the files are numeric or integer.
From summary() we can see minimum, maximum, medium, 1st and 3rd quartile of each varibales. For example, the crime rate ranges from minimum of 0.006 and maximum is 88.97, maximum of 27.74 proportion of non-retail busincess acer in boston, each housing has minimum of 3 room numbers to maximum of 8 rooms, the owners had them with different time points from minimum of 2.9 to maximum of 100 and so on.

The relationship between two dataset or variables in one data can be explained by the correlation among them. To define briefly, When two sets of data are strongly linked together we say they have a High Correlation. The word Correlation is made of Co- (meaning “together”), and Relation. Correlation is Positive when the values increase together, and. Correlation is Negative when one value decreases as the other increases.

From graphical plot pairs(), corrplot(), ggcor() which are from the correlation matrix, we can see some of the parameters are correlated either positively or negatively.

For example from the correlation plots we can clearly see,
crim is postively correlated with rad and tax,
zn is negatively correlated with indus, nox, age and postively correlated with dis
nox is postively correlated with age, rad, tax, istat and negatively correlated with dis, medv and so on.

From correlation plot we will be able to see clearly than the pairs, also ggcor cor also gives us the correlation values.
########################################################################################################################

  1. Standardize the dataset and print out summaries of the scaled data. How did the variables change? Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset. Divide the dataset to train and test sets, so that 80% of the data belongs to the train set. (0-2 points)
#standardize the dataset
boston_scaled <- scale(Boston) #in scaling the outlier are removed
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
class(boston_scaled)
## [1] "matrix"
boston_scaled <- as.data.frame(boston_scaled)

#Create the catagorical variable of crime rate from scaled data
#create the quantile vector
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
#create the categorical variable of crime
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# boston_scaled is available

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

#Check the structure of the new test data
str(boston_scaled)
## 'data.frame':    506 obs. of  14 variables:
##  $ zn     : num  0.285 -0.487 -0.487 -0.487 -0.487 ...
##  $ indus  : num  -1.287 -0.593 -0.593 -1.306 -1.306 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.144 -0.74 -0.74 -0.834 -0.834 ...
##  $ rm     : num  0.413 0.194 1.281 1.015 1.227 ...
##  $ age    : num  -0.12 0.367 -0.266 -0.809 -0.511 ...
##  $ dis    : num  0.14 0.557 0.557 1.077 1.077 ...
##  $ rad    : num  -0.982 -0.867 -0.867 -0.752 -0.752 ...
##  $ tax    : num  -0.666 -0.986 -0.986 -1.105 -1.105 ...
##  $ ptratio: num  -1.458 -0.303 -0.303 0.113 0.113 ...
##  $ black  : num  0.441 0.441 0.396 0.416 0.441 ...
##  $ lstat  : num  -1.074 -0.492 -1.208 -1.36 -1.025 ...
##  $ medv   : num  0.16 -0.101 1.323 1.182 1.486 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
dim(boston_scaled)
## [1] 506  14

Scaling Results: The scaling of data is important for data analysis, it remove the oulier and scale the data to fit in a modle, in this way we will avoid the bias in data analysis.
AS result of scaling we can see changes in maximum, minimum, 1st, 3rd quartile, mean and median of variables.
For example The max() of crime rate from 88.9 decreased to 9.92.
Then in this data, we know all the variables are numeric/integer, we are changing the crime rate as catagorical variable and split the data into train and test to check the how well the model works. Here 80% of data belong to train set, it is done by taking 80% of rows.
As the result of converting the crime rate into categorical variable, now we can clearly see data into four groups where 127 row in low crime rate, 126 row in medium low crime rate, 126 row in medium high and 127 row in high crime rate. The data is almost divided equally based on the crime rate.
#############################################################################################################################

  1. Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot. (0-3 points)

Discussion: Before going into Exercise 4.5, we have to check whether the data is having categorical variables. From str() we can clearly see one of the variable “crime” has factor with four levels of low, med_low, med_high, high which is mandatory for Linear Discriminanat Analysis.

Linear Discriminant Analysis:
Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. As the name implies dimensionality reduction techniques reduce the number of dimensions (i.e. variables) in a dataset while retaining as much information as possible. Discriminant analysis is used when groups are known a priori (unlike in cluster analysis). Each case must have a score on one or more quantitative predictor measures, and a score on a group measure.

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train) #this train data set has 80% of data

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2500000 0.2425743 0.2450495 0.2623762 
## 
## Group means:
##                  zn      indus        chas        nox         rm
## low       1.0034832 -0.9010948 -0.15538550 -0.8842032  0.4055971
## med_low  -0.1243159 -0.2535782  0.04906687 -0.5657898 -0.1394810
## med_high -0.3729012  0.1569535  0.12535782  0.3284715  0.2583587
## high     -0.4872402  1.0170298 -0.04947434  1.0559544 -0.4561416
##                 age        dis        rad        tax      ptratio
## low      -0.9119357  0.9125446 -0.6964601 -0.7646435 -0.442738112
## med_low  -0.3405873  0.3311928 -0.5482663 -0.4678459  0.009698635
## med_high  0.4112640 -0.3769823 -0.4250387 -0.3297825 -0.319124488
## high      0.7959571 -0.8489872  1.6390172  1.5146914  0.781811640
##                black       lstat        medv
## low       0.37477549 -0.76649085  0.51801342
## med_low   0.33076887 -0.14040701 -0.01754658
## med_high  0.07773025 -0.05975297  0.29692282
## high     -0.86082849  0.92236340 -0.71318334
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.08787571  0.71158379 -0.84914296
## indus    0.05561741 -0.20691684  0.22119777
## chas    -0.07120401 -0.05419421  0.22802986
## nox      0.42909737 -0.66152081 -1.30704071
## rm      -0.11280863 -0.15986064 -0.12935336
## age      0.19541078 -0.38010352 -0.18388871
## dis     -0.03708138 -0.24439676  0.05403422
## rad      3.38929531  1.02114614 -0.19140215
## tax      0.05784522 -0.13909744  0.62247473
## ptratio  0.08388646  0.02601791 -0.14504777
## black   -0.10899877  0.02359101  0.14841266
## lstat    0.23744567 -0.29956619  0.32455942
## medv     0.19570156 -0.43114176 -0.27244402
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9543 0.0339 0.0117
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "orange", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

LD analysis Discussion: From the reduction plot, distribution of data into different clustering based on crime is clearly seen. Some of the med_high values are clustered well with high values than the manual clustering, it would be better to cluster them into high category. Also we can see other variables distribution according to crime rate in the dimensional biplot. For example variable rad is highly correlated with high crime rate.
#############################################################################################################################

  1. Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results. (0-3 points)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       17       9        0    0
##   med_low    8      14        6    0
##   med_high   1      12       12    2
##   high       0       0        0   21
#compare the total number of new corrected one to old corrected one

#old uncorrected count
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
#newly corrected count
table(correct_classes)
## correct_classes
##      low  med_low med_high     high 
##       26       28       27       21
#or number in each level in corrected ones
lda.fit$counts
##      low  med_low med_high     high 
##      101       98       99      106

Discussion of predicted model: Manually we divide them based on numeric range, whereas through LDA analysis we can cluster the levels that is best fitted with reference to other variables in dataset. As we can see previously 127 row in low group, whereas 29 of them are corrected and 98 remain in the cluster and so on for each cluster.
#################################################################################################################################

  1. Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results. (0-4 points)
#1. load the Boston data
library(MASS)
data('Boston')

#2. Scale the Boston Data and standardize the dataset
scaled_Boston <- scale(Boston)

#3. calculate the distances between the observations

# euclidean distance matrix
dist_eu <- dist(scaled_Boston)

# look at the summary of the distances
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(Boston, method = 'manhattan')

# look at the summary of the distances
summary(dist_man)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    2.016  149.145  279.505  342.899  509.707 1198.265
#without scaling we get the following answer
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#  1.119  85.624 170.539 226.315 371.950 626.047 
#    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
#   2.016  149.145  279.505  342.899  509.707 1198.265 

#4. Run K-means cluster
# k-means clustering
km <-kmeans(scaled_Boston, centers = 4)

# plot the Boston dataset with clusters
pairs(scaled_Boston, col = km$cluster)

Discussion:
From the above kmeans clustering, the data is clustered based on the distance of variables. The pairs plot shows us the distrubution of clusters in each variable. Here i have divied into four center and it is showen in four different colors as seen in the picture. This clustering differ from LD analysis, which was done based on the catagorical variable wheres this is unsupervised clustering based on distance.

Super-Bonus: Run the code below for the (scaled) train data that you used to fit the LDA. The code creates a matrix product, which is a projection of the data points.

model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#install.packages("plotly")
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)